How To Protect Your Files From Robots
by Erika Lawal
Optimizing website pages for the search engines without running
into trouble at the very least causes most of us webmasters to
keep our brain cells finely honed, and at worst induces massive
migraines!
One of the most common challenges for us all is how to present
"clean", relevant and original content to a wide range of
visitors.
You may find that you want to exclude search engine and other
robots from all or part of your website for a number of reasons
including;
- you want to write similar pages for different types of
visitors, but don't want to be penalized for duplication.
- you want to prepare pages or files that you don't want
viewed.
It's very easy to achieve this by one of two means. You can use
either a robot.txt file or a meta tag.
Let's de-mystify the process of writing these files and tags!
WHAT IS A ROBOTS.TXT FILE?
A robots.txt file is an instruction to the robots that travel
the web, spidering the pages they find there. There are several
forms such a file can take - how often to traverse the site, if
at all, and how.
The robots.txt file we're considering here is an exclusion
instruction - think of it as a "no entry" sign to robots.
You can write a file to exclude ("disallow") robots from all, or
just part of your site.
Before you begin, you need to know how to write the .txt file.
Prepare it in a text editor such as Notepad. Don't attempt it in
Word or an HTML editor such as FrontPage. When you're finished,
save it as "robots.txt".
WHAT TO PUT IN YOUR ROBOTS.TXT FILE
If you want to disallow all robots, you'd write;
User-agent: *
Disallow: /
And that's all. Nothing else.
What about if you only want to exclude part of your site?
Let's pretend you're running a website which advises on raising
children. Your material will be relevant to surfers who live in
many countries, but if you want them to really sit up and look,
especially if you want them to buy from you, you'll need to make
sure that your content is region-specific, including references,
idiom and spelling.
This situation is an ideal candidate for a robots exclusion .txt
file.
You've written all the pages you want to show to surfers in
Canada, UK, and Australia in 3 separate directories which
surfers will access by clicking on an appropriate link on your
main pages.
The directories are:
/ca/
/uk/
/au/
To disallow robots from these directories write the following
.txt file;
User-agent: *
Disallow: /ca/
Disallow: /uk/
Disallow: /au/
It may be that you want to allow some robots and disallow others.
In our example, it may be that you want to disallow just one
robot, from one directory, in which case you'd write;
User-agent: NastyBot
Disallow: /ca/
Or to exclude all robots except one, which you want to traverse
all of your site;
User-agent: NiceBot
Disallow:
User-agent: *
Disallow: /ca/
Note that if you don't enter a slash, that means the robots are
permitted to read the whole site. "*" means all known robots,
So in the last .txt file example, all robots are excluded from
your Canadian directory, except NiceBot, which can read the
whole site.
Easy isn't it!
WHERE TO PUT YOUR ROBOTS.TXT FILE
Once created, your file needs to go into your root directory.
This is the same directory which contains your home page. Don't
put it anywhere else, because the robots won't see it.
Note that you can only have ONE robots.txt file per site, so any
modifications will need to be integrated into your original file.
Note also that writing a no index robots.txt file means these
pages won't be indexed, but that won't matter if you've
optimized your indexed pages properly.
In our Ca/UK/Au example above, your traffic will find your
indexed global/US pages via the search engines, and will make
the link to their "nationality" page from the point of entry to
your site - we've all seen the little flag links on other sites
- just put up a flag graphic and say for example; "UK Visitors
Click Here".
If you want to learn more about exclusion robots.txt files,
visit:
http://www.robotstxt.org/wc/exclusion-admin.html
If you prefer/need to exclude individual pages from being viewed
by robots, you can do this using a robots.txt file, but you can
also achieve it using a meta tag on your web page between the
<head> tags. The universal exclusion is as follows;
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
It may be that you want robots to index your pages, but not to
archive them. There may be a range of reasons why you don't
want search engines to keep copies of old pages - the most
prevalent one among webmasters is because they are cloaking
pages and don't want it known that the page served to search
engines is a different one to that seen by surfers, but it's
also possible to have perfectly "legitimate" reasons for wanting
to exclude parts of your site from public scrutiny.
Whatever your reason, if you want to avoid your page being
indexed, the universal tag is;
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
For Google (the search engine you are most likely to want to
avoid archiving your pages for its cache feature), the tag is:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> .
To learn more about exclusion meta tags, visit;
http://www.robotstxt.org/wc/exclusion.html#meta
Don't be put off by the jargon; writing these files and tags
is one of the easiest and most useful technical tasks you can
undertake as a webmaster - write a file today and save yourself
hundreds of hours!
================================================================
Erika Lawal writes Daily Internet Marketing Tips for webmasters
desperately in search of cutting edge site optimization and
marketing advice that produces results. Get a FREE series of
our Tips by visiting:
http://www.dailyinternetmarketingtips.com/spronews.html
================================================================